Sales forecasting¶
In this project we going to establish a model for forecasting sales of different items from different stores. Our dataset contains the sales data for 10 stores and 50 items.
Data set:¶
Data fields date - Date of the sale data. There are no holiday effects or store closures.
store - Store ID
item - Item ID
sales - Number of items sold at a particular store on a particular date.
Data Source¶
https://www.kaggle.com/competitions/demand-forecasting-kernels-only/data
## Change the working directory
import os
os.chdir("E:/Time series/Fine Grained Demand Forecasting")
print(os.getcwd())
E:\Time series\Fine Grained Demand Forecasting
## getting the dataset
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
| date | store | item | sales | |
|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | 1 | 13 |
| 1 | 2013-01-02 | 1 | 1 | 11 |
| 2 | 2013-01-03 | 1 | 1 | 14 |
| 3 | 2013-01-04 | 1 | 1 | 13 |
| 4 | 2013-01-05 | 1 | 1 | 10 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 913000 entries, 0 to 912999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 913000 non-null object 1 store 913000 non-null int64 2 item 913000 non-null int64 3 sales 913000 non-null int64 dtypes: int64(3), object(1) memory usage: 27.9+ MB
## Sorting the dataset according to store number and item number
df_sorted = df.sort_values(by=['store', 'item'], ascending=[True, True])
df_sorted
| date | store | item | sales | |
|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | 1 | 13 |
| 1 | 2013-01-02 | 1 | 1 | 11 |
| 2 | 2013-01-03 | 1 | 1 | 14 |
| 3 | 2013-01-04 | 1 | 1 | 13 |
| 4 | 2013-01-05 | 1 | 1 | 10 |
| ... | ... | ... | ... | ... |
| 912995 | 2017-12-27 | 10 | 50 | 63 |
| 912996 | 2017-12-28 | 10 | 50 | 59 |
| 912997 | 2017-12-29 | 10 | 50 | 74 |
| 912998 | 2017-12-30 | 10 | 50 | 62 |
| 912999 | 2017-12-31 | 10 | 50 | 82 |
913000 rows × 4 columns
## Create dataframe for store 1
df_1=df_sorted.iloc[:91300]
df_1
| date | store | item | sales | |
|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | 1 | 13 |
| 1 | 2013-01-02 | 1 | 1 | 11 |
| 2 | 2013-01-03 | 1 | 1 | 14 |
| 3 | 2013-01-04 | 1 | 1 | 13 |
| 4 | 2013-01-05 | 1 | 1 | 10 |
| ... | ... | ... | ... | ... |
| 896561 | 2017-12-27 | 1 | 50 | 38 |
| 896562 | 2017-12-28 | 1 | 50 | 52 |
| 896563 | 2017-12-29 | 1 | 50 | 59 |
| 896564 | 2017-12-30 | 1 | 50 | 66 |
| 896565 | 2017-12-31 | 1 | 50 | 45 |
91300 rows × 4 columns
# Pivot without aggregation ( this will put the sales for each items in columns to create a multivariable time series)
pivot_df_1 = df_1.pivot( index='date',columns='item', values='sales')
## Changing column names to item_1,..
columnnames = {}
count = 0
for i in pivot_df_1.columns:
count += 1
columnnames[i] = f"store1_item_{count}"
#columnnames
pivot_df_1.rename(columns = columnnames ,inplace = True)
# Rename the dataframe to df_1
df_1=pivot_df_1
df_1.head() ## date is already in index.
| item | store1_item_1 | store1_item_2 | store1_item_3 | store1_item_4 | store1_item_5 | store1_item_6 | store1_item_7 | store1_item_8 | store1_item_9 | store1_item_10 | ... | store1_item_41 | store1_item_42 | store1_item_43 | store1_item_44 | store1_item_45 | store1_item_46 | store1_item_47 | store1_item_48 | store1_item_49 | store1_item_50 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | |||||||||||||||||||||
| 2013-01-01 | 13 | 33 | 15 | 10 | 11 | 31 | 25 | 33 | 18 | 37 | ... | 6 | 21 | 22 | 20 | 37 | 30 | 17 | 21 | 18 | 30 |
| 2013-01-02 | 11 | 43 | 30 | 11 | 6 | 36 | 23 | 37 | 23 | 34 | ... | 15 | 24 | 27 | 15 | 40 | 30 | 15 | 26 | 10 | 32 |
| 2013-01-03 | 14 | 23 | 14 | 8 | 8 | 18 | 34 | 38 | 25 | 32 | ... | 5 | 14 | 19 | 11 | 42 | 30 | 5 | 25 | 17 | 25 |
| 2013-01-04 | 13 | 18 | 10 | 19 | 9 | 19 | 36 | 54 | 22 | 45 | ... | 9 | 22 | 29 | 22 | 49 | 37 | 13 | 26 | 22 | 32 |
| 2013-01-05 | 10 | 34 | 23 | 12 | 8 | 31 | 38 | 51 | 29 | 35 | ... | 13 | 18 | 34 | 19 | 52 | 28 | 12 | 28 | 15 | 35 |
5 rows × 50 columns
## Changing the index to datetime
df_1.index = pd.to_datetime(df_1.index)
df_1.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1826 entries, 2013-01-01 to 2017-12-31 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 store1_item_1 1826 non-null int64 1 store1_item_2 1826 non-null int64 2 store1_item_3 1826 non-null int64 3 store1_item_4 1826 non-null int64 4 store1_item_5 1826 non-null int64 5 store1_item_6 1826 non-null int64 6 store1_item_7 1826 non-null int64 7 store1_item_8 1826 non-null int64 8 store1_item_9 1826 non-null int64 9 store1_item_10 1826 non-null int64 10 store1_item_11 1826 non-null int64 11 store1_item_12 1826 non-null int64 12 store1_item_13 1826 non-null int64 13 store1_item_14 1826 non-null int64 14 store1_item_15 1826 non-null int64 15 store1_item_16 1826 non-null int64 16 store1_item_17 1826 non-null int64 17 store1_item_18 1826 non-null int64 18 store1_item_19 1826 non-null int64 19 store1_item_20 1826 non-null int64 20 store1_item_21 1826 non-null int64 21 store1_item_22 1826 non-null int64 22 store1_item_23 1826 non-null int64 23 store1_item_24 1826 non-null int64 24 store1_item_25 1826 non-null int64 25 store1_item_26 1826 non-null int64 26 store1_item_27 1826 non-null int64 27 store1_item_28 1826 non-null int64 28 store1_item_29 1826 non-null int64 29 store1_item_30 1826 non-null int64 30 store1_item_31 1826 non-null int64 31 store1_item_32 1826 non-null int64 32 store1_item_33 1826 non-null int64 33 store1_item_34 1826 non-null int64 34 store1_item_35 1826 non-null int64 35 store1_item_36 1826 non-null int64 36 store1_item_37 1826 non-null int64 37 store1_item_38 1826 non-null int64 38 store1_item_39 1826 non-null int64 39 store1_item_40 1826 non-null int64 40 store1_item_41 1826 non-null int64 41 store1_item_42 1826 non-null int64 42 store1_item_43 1826 non-null int64 43 store1_item_44 1826 non-null int64 44 store1_item_45 1826 non-null int64 45 store1_item_46 1826 non-null int64 46 store1_item_47 1826 non-null int64 47 store1_item_48 1826 non-null int64 48 store1_item_49 1826 non-null int64 49 store1_item_50 1826 non-null int64 dtypes: int64(50) memory usage: 727.5 KB
# Checking the missing values
df_1.isna().sum()
item store1_item_1 0 store1_item_2 0 store1_item_3 0 store1_item_4 0 store1_item_5 0 store1_item_6 0 store1_item_7 0 store1_item_8 0 store1_item_9 0 store1_item_10 0 store1_item_11 0 store1_item_12 0 store1_item_13 0 store1_item_14 0 store1_item_15 0 store1_item_16 0 store1_item_17 0 store1_item_18 0 store1_item_19 0 store1_item_20 0 store1_item_21 0 store1_item_22 0 store1_item_23 0 store1_item_24 0 store1_item_25 0 store1_item_26 0 store1_item_27 0 store1_item_28 0 store1_item_29 0 store1_item_30 0 store1_item_31 0 store1_item_32 0 store1_item_33 0 store1_item_34 0 store1_item_35 0 store1_item_36 0 store1_item_37 0 store1_item_38 0 store1_item_39 0 store1_item_40 0 store1_item_41 0 store1_item_42 0 store1_item_43 0 store1_item_44 0 store1_item_45 0 store1_item_46 0 store1_item_47 0 store1_item_48 0 store1_item_49 0 store1_item_50 0 dtype: int64
print(df_1.index.min())
print(df_1.index.max())
2013-01-01 00:00:00 2017-12-31 00:00:00
df_1["store1_item_1"].max()
50
Plotting Time Series¶
We are creating time series plot for only first five items. Putting all of 50 grpahs together will be messy.
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Initialize the figure
fig = go.Figure()
# Define number of rows and columns for the subplots
rows, cols = 5, 1 # 10x5 grid
# Create subplots
fig = make_subplots(rows=rows, cols=cols, shared_xaxes=False, shared_yaxes=False,
vertical_spacing=0.1, horizontal_spacing=0.1,
subplot_titles=df_1.columns[0:5])
# Add each variable to a separate subplot
for i, column in enumerate(df_1.columns[0:5]):
row = (i // cols) + 1
col = (i % cols) + 1
fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=column), row=row, col=col)
# Customize layout
fig.update_layout(
title="Multivariable Time Series with Grid Subplots",
xaxis_title="Date",
height=1300, # Adjust height for overall layout
showlegend=False,
template="plotly_dark"
)
# Customize y-axis titles for each subplot
for i, column in enumerate(df_1.columns[0:5], start=1):
fig['layout'][f'yaxis{i}']['title'] = column
# Show the figure
fig.show()
Seasonal Decomposition¶
Seasonal decomposition separates a time series into three components: Trend, Seasonal, and Residual (Error). For multiple time series variables, this can be done by decomposing each variable individually and then plotting these components side by side in Plotly.
To perform seasonal decomposition in Python, you can use the seasonal_decompose function from the statsmodels library, and then visualize each component in a Plotly subplot. For seasonal decomposition, we have considered periods weekly, monthly and quarterly.
## Seasonal Decomposition with weekly period
from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}
# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
decomposed = seasonal_decompose(df_1[column], model='additive', period=7)
decomposed_data[column] = decomposed
# Set up the subplot grid
rows = 5 * 3 # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1 # Only one column
# Create subplots
fig = make_subplots(
rows=rows, cols=cols, shared_xaxes=False,
vertical_spacing=0.02,
subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)
# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
decomposed = decomposed_data[column]
# Calculate the starting row for each variable
start_row = var_idx * 3 + 1
# Original time series
#fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
# Trend component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
row=start_row, col=1)
# Seasonal component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
row=start_row + 1, col=1)
# Residual component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
row=start_row + 2, col=1)
# Customize layout
fig.update_layout(
title="Seasonal Decomposition of Multiple Items of Store 1",
height=700 * len(df_1.columns[ : 5]), # Adjust height based on number of variables
template="plotly_dark",
showlegend=False
)
# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
start_row = var_idx * 3 + 1
#fig.update_yaxes(title_text="column", row=start_row, col=1)
fig.update_yaxes(title_text="Trend", row=start_row, col=1)
fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)
# Show the figure
fig.show()
decomposed.seasonal
date
2013-01-01 -1.374647
2013-01-02 -1.078493
2013-01-03 0.411617
2013-01-04 0.686892
2013-01-05 1.922057
...
2017-12-27 -1.078493
2017-12-28 0.411617
2017-12-29 0.686892
2017-12-30 1.922057
2017-12-31 2.991287
Name: seasonal, Length: 1826, dtype: float64
decomposed.trend
date
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 9.428571
2013-01-05 9.285714
...
2017-12-27 13.857143
2017-12-28 14.142857
2017-12-29 NaN
2017-12-30 NaN
2017-12-31 NaN
Name: trend, Length: 1826, dtype: float64
decomposed.resid
date
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 NaN
2013-01-04 -1.115463
2013-01-05 -3.207771
...
2017-12-27 -6.778650
2017-12-28 0.445526
2017-12-29 NaN
2017-12-30 NaN
2017-12-31 NaN
Name: resid, Length: 1826, dtype: float64
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
| seasonal | trend | resid | actual_values | |
|---|---|---|---|---|
| date | ||||
| 2013-01-01 | -1.374647 | NaN | NaN | 11.0 |
| 2013-01-02 | -1.078493 | NaN | NaN | 6.0 |
| 2013-01-03 | 0.411617 | NaN | NaN | 8.0 |
| 2013-01-04 | 0.686892 | 9.428571 | -1.115463 | 9.0 |
| 2013-01-05 | 1.922057 | 9.285714 | -3.207771 | 8.0 |
| 2013-01-06 | 2.991287 | 9.428571 | 0.580141 | 13.0 |
| 2013-01-07 | -3.558713 | 9.428571 | 5.130141 | 11.0 |
| 2013-01-08 | -1.374647 | 9.571429 | 1.803218 | 10.0 |
| 2013-01-09 | -1.078493 | 9.571429 | -1.492936 | 7.0 |
| 2013-01-10 | 0.411617 | 9.285714 | -1.697331 | 8.0 |
| 2013-01-11 | 0.686892 | 8.428571 | 0.884537 | 10.0 |
| 2013-01-12 | 1.922057 | 8.428571 | -2.350628 | 8.0 |
| 2013-01-13 | 2.991287 | 8.285714 | -0.277002 | 11.0 |
| 2013-01-14 | -3.558713 | 8.142857 | 0.415856 | 5.0 |
| 2013-01-15 | -1.374647 | 8.142857 | 3.231790 | 10.0 |
| 2013-01-16 | -1.078493 | 8.142857 | -1.064364 | 6.0 |
| 2013-01-17 | 0.411617 | 7.857143 | -1.268760 | 7.0 |
| 2013-01-18 | 0.686892 | 7.714286 | 1.598823 | 10.0 |
| 2013-01-19 | 1.922057 | 7.571429 | -1.493485 | 8.0 |
| 2013-01-20 | 2.991287 | 7.571429 | -1.562716 | 9.0 |
| 2013-01-21 | -3.558713 | 8.285714 | -0.727002 | 4.0 |
| 2013-01-22 | -1.374647 | 7.571429 | 2.803218 | 9.0 |
| 2013-01-23 | -1.078493 | 7.428571 | -0.350078 | 6.0 |
| 2013-01-24 | 0.411617 | 8.714286 | 2.874097 | 12.0 |
| 2013-01-25 | 0.686892 | 9.428571 | -5.115463 | 5.0 |
| 2013-01-26 | 1.922057 | 9.428571 | -4.350628 | 7.0 |
| 2013-01-27 | 2.991287 | 9.857143 | 5.151570 | 18.0 |
| 2013-01-28 | -3.558713 | 10.000000 | 2.558713 | 9.0 |
| 2013-01-29 | -1.374647 | 10.428571 | -0.053925 | 9.0 |
| 2013-01-30 | -1.078493 | 11.714286 | -1.635793 | 9.0 |
| 2013-01-31 | 0.411617 | 10.714286 | 1.874097 | 13.0 |
| 2013-02-01 | 0.686892 | 10.571429 | -3.258320 | 8.0 |
| 2013-02-02 | 1.922057 | 11.428571 | 2.649372 | 16.0 |
| 2013-02-03 | 2.991287 | 11.142857 | -3.134144 | 11.0 |
| 2013-02-04 | -3.558713 | 10.714286 | 0.844427 | 8.0 |
| 2013-02-05 | -1.374647 | 10.571429 | 5.803218 | 15.0 |
| 2013-02-06 | -1.078493 | 9.428571 | -1.350078 | 7.0 |
| 2013-02-07 | 0.411617 | 9.285714 | 0.302669 | 10.0 |
| 2013-02-08 | 0.686892 | 10.000000 | -3.686892 | 7.0 |
| 2013-02-09 | 1.922057 | 9.714286 | -3.636342 | 8.0 |
| 2013-02-10 | 2.991287 | 9.857143 | -2.848430 | 10.0 |
| 2013-02-11 | -3.558713 | 9.857143 | 6.701570 | 13.0 |
| 2013-02-12 | -1.374647 | 10.428571 | 3.946075 | 13.0 |
| 2013-02-13 | -1.078493 | 10.714286 | -1.635793 | 8.0 |
| 2013-02-14 | 0.411617 | 10.857143 | -1.268760 | 10.0 |
| 2013-02-15 | 0.686892 | 9.714286 | 0.598823 | 11.0 |
| 2013-02-16 | 1.922057 | 9.285714 | -1.207771 | 10.0 |
| 2013-02-17 | 2.991287 | 9.142857 | -1.134144 | 11.0 |
| 2013-02-18 | -3.558713 | 9.285714 | -0.727002 | 5.0 |
| 2013-02-19 | -1.374647 | 9.857143 | 1.517504 | 10.0 |
## Seasonal Decomposition with monthly period
from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}
# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
decomposed = seasonal_decompose(df_1[column], model='additive', period=31)
decomposed_data[column] = decomposed
# Set up the subplot grid
rows = 5 * 3 # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1 # Only one column
# Create subplots
fig = make_subplots(
rows=rows, cols=cols, shared_xaxes=False,
vertical_spacing=0.02,
subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)
# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
decomposed = decomposed_data[column]
# Calculate the starting row for each variable
start_row = var_idx * 3 + 1
# Original time series
#fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
# Trend component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
row=start_row, col=1)
# Seasonal component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
row=start_row + 1, col=1)
# Residual component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
row=start_row + 2, col=1)
# Customize layout
fig.update_layout(
title="Seasonal Decomposition of Multiple Items of Store 1",
height=700 * len(df_1.columns[ : 5]), # Adjust height based on number of variables
template="plotly_dark",
showlegend=False
)
# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
start_row = var_idx * 3 + 1
#fig.update_yaxes(title_text="column", row=start_row, col=1)
fig.update_yaxes(title_text="Trend", row=start_row, col=1)
fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)
# Show the figure
fig.show()
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
| seasonal | trend | resid | actual_values | |
|---|---|---|---|---|
| date | ||||
| 2013-01-01 | 0.000462 | NaN | NaN | 11.0 |
| 2013-01-02 | -0.553486 | NaN | NaN | 6.0 |
| 2013-01-03 | 0.340841 | NaN | NaN | 8.0 |
| 2013-01-04 | -0.675288 | NaN | NaN | 9.0 |
| 2013-01-05 | 0.474322 | NaN | NaN | 8.0 |
| 2013-01-06 | 0.002687 | NaN | NaN | 13.0 |
| 2013-01-07 | -0.309326 | NaN | NaN | 11.0 |
| 2013-01-08 | 0.222932 | NaN | NaN | 10.0 |
| 2013-01-09 | -0.015110 | NaN | NaN | 7.0 |
| 2013-01-10 | -0.589638 | NaN | NaN | 8.0 |
| 2013-01-11 | -0.284855 | NaN | NaN | 10.0 |
| 2013-01-12 | -0.321562 | NaN | NaN | 8.0 |
| 2013-01-13 | 0.191230 | NaN | NaN | 11.0 |
| 2013-01-14 | 0.368971 | NaN | NaN | 5.0 |
| 2013-01-15 | -0.144894 | NaN | NaN | 10.0 |
| 2013-01-16 | 0.717926 | 8.903226 | -3.621152 | 6.0 |
| 2013-01-17 | 0.803021 | 8.806452 | -2.609472 | 7.0 |
| 2013-01-18 | -0.530127 | 9.129032 | 1.401095 | 10.0 |
| 2013-01-19 | 1.103911 | 9.225806 | -2.329717 | 8.0 |
| 2013-01-20 | -0.275956 | 9.193548 | 0.082407 | 9.0 |
| 2013-01-21 | -0.033464 | 9.419355 | -5.385891 | 4.0 |
| 2013-01-22 | 0.241842 | 9.225806 | -0.467648 | 9.0 |
| 2013-01-23 | 0.237948 | 9.193548 | -3.431497 | 6.0 |
| 2013-01-24 | 0.530496 | 9.096774 | 2.372730 | 12.0 |
| 2013-01-25 | -0.748703 | 9.129032 | -3.380329 | 5.0 |
| 2013-01-26 | -0.460049 | 9.193548 | -1.733499 | 7.0 |
| 2013-01-27 | -0.114666 | 9.290323 | 8.824343 | 18.0 |
| 2013-01-28 | 0.376992 | 9.451613 | -0.828605 | 9.0 |
| 2013-01-29 | -0.698648 | 9.354839 | 0.343809 | 9.0 |
| 2013-01-30 | -0.066835 | 9.516129 | -0.449294 | 9.0 |
| 2013-01-31 | 0.209027 | 9.548387 | 3.242585 | 13.0 |
| 2013-02-01 | 0.000462 | 9.677419 | -1.677882 | 8.0 |
| 2013-02-02 | -0.553486 | 9.806452 | 6.747035 | 16.0 |
| 2013-02-03 | 0.340841 | 9.645161 | 1.013998 | 11.0 |
| 2013-02-04 | -0.675288 | 9.709677 | -1.034389 | 8.0 |
| 2013-02-05 | 0.474322 | 9.645161 | 4.880516 | 15.0 |
| 2013-02-06 | 0.002687 | 9.870968 | -2.873655 | 7.0 |
| 2013-02-07 | -0.309326 | 10.064516 | 0.244810 | 10.0 |
| 2013-02-08 | 0.222932 | 10.290323 | -3.513254 | 7.0 |
| 2013-02-09 | -0.015110 | 10.193548 | -2.178438 | 8.0 |
| 2013-02-10 | -0.589638 | 10.129032 | 0.460605 | 10.0 |
| 2013-02-11 | -0.284855 | 10.096774 | 3.188080 | 13.0 |
| 2013-02-12 | -0.321562 | 9.806452 | 3.515110 | 13.0 |
| 2013-02-13 | 0.191230 | 9.903226 | -2.094456 | 8.0 |
| 2013-02-14 | 0.368971 | 9.967742 | -0.336713 | 10.0 |
| 2013-02-15 | -0.144894 | 10.193548 | 0.951346 | 11.0 |
| 2013-02-16 | 0.717926 | 10.129032 | -0.846958 | 10.0 |
| 2013-02-17 | 0.803021 | 10.064516 | 0.132463 | 11.0 |
| 2013-02-18 | -0.530127 | 9.870968 | -4.340841 | 5.0 |
| 2013-02-19 | 1.103911 | 9.806452 | -0.910362 | 10.0 |
## Seasonal Decomposition with quarterly period
from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}
# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
decomposed = seasonal_decompose(df_1[column], model='additive', period=31)
decomposed_data[column] = decomposed
# Set up the subplot grid
rows = 5 * 3 # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1 # Only one column
# Create subplots
fig = make_subplots(
rows=rows, cols=cols, shared_xaxes=False,
vertical_spacing=0.02,
subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)
# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
decomposed = decomposed_data[column]
# Calculate the starting row for each variable
start_row = var_idx * 3 + 1
# Original time series
#fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
# Trend component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
row=start_row, col=1)
# Seasonal component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
row=start_row + 1, col=1)
# Residual component
fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
row=start_row + 2, col=1)
# Customize layout
fig.update_layout(
title="Seasonal Decomposition of Multiple Items of Store 1",
height=700 * len(df_1.columns[ : 5]), # Adjust height based on number of variables
template="plotly_dark",
showlegend=False
)
# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
start_row = var_idx * 3 + 1
#fig.update_yaxes(title_text="column", row=start_row, col=1)
fig.update_yaxes(title_text="Trend", row=start_row, col=1)
fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)
# Show the figure
fig.show()
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
| seasonal | trend | resid | actual_values | |
|---|---|---|---|---|
| date | ||||
| 2013-01-01 | 0.000462 | NaN | NaN | 11.0 |
| 2013-01-02 | -0.553486 | NaN | NaN | 6.0 |
| 2013-01-03 | 0.340841 | NaN | NaN | 8.0 |
| 2013-01-04 | -0.675288 | NaN | NaN | 9.0 |
| 2013-01-05 | 0.474322 | NaN | NaN | 8.0 |
| 2013-01-06 | 0.002687 | NaN | NaN | 13.0 |
| 2013-01-07 | -0.309326 | NaN | NaN | 11.0 |
| 2013-01-08 | 0.222932 | NaN | NaN | 10.0 |
| 2013-01-09 | -0.015110 | NaN | NaN | 7.0 |
| 2013-01-10 | -0.589638 | NaN | NaN | 8.0 |
| 2013-01-11 | -0.284855 | NaN | NaN | 10.0 |
| 2013-01-12 | -0.321562 | NaN | NaN | 8.0 |
| 2013-01-13 | 0.191230 | NaN | NaN | 11.0 |
| 2013-01-14 | 0.368971 | NaN | NaN | 5.0 |
| 2013-01-15 | -0.144894 | NaN | NaN | 10.0 |
| 2013-01-16 | 0.717926 | 8.903226 | -3.621152 | 6.0 |
| 2013-01-17 | 0.803021 | 8.806452 | -2.609472 | 7.0 |
| 2013-01-18 | -0.530127 | 9.129032 | 1.401095 | 10.0 |
| 2013-01-19 | 1.103911 | 9.225806 | -2.329717 | 8.0 |
| 2013-01-20 | -0.275956 | 9.193548 | 0.082407 | 9.0 |
| 2013-01-21 | -0.033464 | 9.419355 | -5.385891 | 4.0 |
| 2013-01-22 | 0.241842 | 9.225806 | -0.467648 | 9.0 |
| 2013-01-23 | 0.237948 | 9.193548 | -3.431497 | 6.0 |
| 2013-01-24 | 0.530496 | 9.096774 | 2.372730 | 12.0 |
| 2013-01-25 | -0.748703 | 9.129032 | -3.380329 | 5.0 |
| 2013-01-26 | -0.460049 | 9.193548 | -1.733499 | 7.0 |
| 2013-01-27 | -0.114666 | 9.290323 | 8.824343 | 18.0 |
| 2013-01-28 | 0.376992 | 9.451613 | -0.828605 | 9.0 |
| 2013-01-29 | -0.698648 | 9.354839 | 0.343809 | 9.0 |
| 2013-01-30 | -0.066835 | 9.516129 | -0.449294 | 9.0 |
| 2013-01-31 | 0.209027 | 9.548387 | 3.242585 | 13.0 |
| 2013-02-01 | 0.000462 | 9.677419 | -1.677882 | 8.0 |
| 2013-02-02 | -0.553486 | 9.806452 | 6.747035 | 16.0 |
| 2013-02-03 | 0.340841 | 9.645161 | 1.013998 | 11.0 |
| 2013-02-04 | -0.675288 | 9.709677 | -1.034389 | 8.0 |
| 2013-02-05 | 0.474322 | 9.645161 | 4.880516 | 15.0 |
| 2013-02-06 | 0.002687 | 9.870968 | -2.873655 | 7.0 |
| 2013-02-07 | -0.309326 | 10.064516 | 0.244810 | 10.0 |
| 2013-02-08 | 0.222932 | 10.290323 | -3.513254 | 7.0 |
| 2013-02-09 | -0.015110 | 10.193548 | -2.178438 | 8.0 |
| 2013-02-10 | -0.589638 | 10.129032 | 0.460605 | 10.0 |
| 2013-02-11 | -0.284855 | 10.096774 | 3.188080 | 13.0 |
| 2013-02-12 | -0.321562 | 9.806452 | 3.515110 | 13.0 |
| 2013-02-13 | 0.191230 | 9.903226 | -2.094456 | 8.0 |
| 2013-02-14 | 0.368971 | 9.967742 | -0.336713 | 10.0 |
| 2013-02-15 | -0.144894 | 10.193548 | 0.951346 | 11.0 |
| 2013-02-16 | 0.717926 | 10.129032 | -0.846958 | 10.0 |
| 2013-02-17 | 0.803021 | 10.064516 | 0.132463 | 11.0 |
| 2013-02-18 | -0.530127 | 9.870968 | -4.340841 | 5.0 |
| 2013-02-19 | 1.103911 | 9.806452 | -0.910362 | 10.0 |
To compute and plot the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) in Python with Plotly, you can use statsmodels to get the ACF and PACF values and then plotly.graph_objects to create the plots.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.stattools import acf, pacf
# Calculate ACF and PACF
acf_values = acf(df_1["store1_item_1"], nlags=365) # Adjust nlags as necessary
pacf_values = pacf(df_1["store1_item_1"], nlags=365)
# Create ACF plot
acf_fig = go.Figure()
acf_fig.add_trace(go.Scatter(x=np.arange(len(acf_values)), y=acf_values, name='ACF'))
acf_fig.update_layout(
title="Autocorrelation Function (ACF)",
xaxis_title="Lags",
yaxis_title="ACF",
template="plotly_white"
)
# Create PACF plot
pacf_fig = go.Figure()
pacf_fig.add_trace(go.Scatter(x=np.arange(len(pacf_values)), y=pacf_values, name='PACF'))
pacf_fig.update_layout(
title="Partial Autocorrelation Function (PACF)",
xaxis_title="Lags",
yaxis_title="PACF",
template="plotly_white"
)
# Show the plots
acf_fig.show()
pacf_fig.show()
def create_corr_plot(series, plot_pacf=False):
corr_array = pacf(series.dropna(), alpha=0.05) if plot_pacf else acf(series.dropna(), alpha=0.05)
lower_y = corr_array[1][:,0] - corr_array[0]
upper_y = corr_array[1][:,1] - corr_array[0]
fig = go.Figure()
[fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f')
for x in range(len(corr_array[0]))]
fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color='#1f77b4',
marker_size=12)
fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
fill='tonexty', line_color='rgba(255,255,255,0)')
fig.update_traces(showlegend=False)
fig.update_xaxes(range=[-1,42])
fig.update_yaxes(zerolinecolor='#000000')
title='Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
fig.update_layout(title=title)
fig.show()
create_corr_plot(df_1["store1_item_1"], plot_pacf=False)
def create_corr_plot(series, plot_pacf=False, nlags=None):
corr_array = pacf(series.dropna(), alpha=0.05,nlags=nlags) if plot_pacf else acf(series.dropna(), alpha=0.05,nlags=nlags)
lower_y = corr_array[1][:,0] - corr_array[0]
upper_y = corr_array[1][:,1] - corr_array[0]
fig = go.Figure()
[fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f')
for x in range(len(corr_array[0]))]
fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color='#1f77b4',
marker_size=6)
fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
fill='tonexty', line_color='rgba(255,255,255,0)')
fig.update_traces(showlegend=False)
fig.update_xaxes(range=[-1, nlags + 1])
fig.update_yaxes(zerolinecolor='#000000')
title='Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
fig.update_layout(title=title)
fig.show()
create_corr_plot(df_1["store1_item_1"], plot_pacf=False,nlags=93)
create_corr_plot(df_1["store1_item_1"], plot_pacf=True,nlags=93)
create_corr_plot(df_1["store1_item_1"], plot_pacf=True,nlags=180)
Looks like AR(7) would be suitable for this time series. But we need to check the stationarity.
from statsmodels.tsa.stattools import adfuller
# Perform the ADF test
result = adfuller(df_1["store1_item_1"])
# Extract and display the test results
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:")
for key, value in result[4].items():
print(f" {key}: {value}")
# Interpret the p-value
if result[1] < 0.05:
print("The time series is likely stationary (reject H0).")
else:
print("The time series is likely non-stationary (fail to reject H0).")
ADF Statistic: -3.157670556332814 p-value: 0.022569380626570913 Critical Values: 1%: -3.4339840952648695 5%: -2.8631452508003057 10%: -2.567624583142913 The time series is likely stationary (reject H0).
Check any other item of store 1¶
create_corr_plot(df_1["store1_item_5"], plot_pacf=False,nlags=93)
create_corr_plot(df_1["store1_item_2"], plot_pacf=True,nlags=93)
from statsmodels.tsa.stattools import adfuller
# Perform the ADF test
result = adfuller(df_1["store1_item_2"])
# Extract and display the test results
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:")
for key, value in result[4].items():
print(f" {key}: {value}")
# Interpret the p-value
if result[1] < 0.05:
print("The time series is likely stationary (reject H0).")
else:
print("The time series is likely non-stationary (fail to reject H0).")
ADF Statistic: -3.163241216645516 p-value: 0.0222134781646119 Critical Values: 1%: -3.4339840952648695 5%: -2.8631452508003057 10%: -2.567624583142913 The time series is likely stationary (reject H0).
Lets fit AR(7) model.¶
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
# Convert to pandas Series
data_series = pd.Series(df_1["store1_item_1"])
# Split the data into train and test sets
train_size = int(len(data_series) * 0.8)
train, test = data_series[:train_size], data_series[train_size:]
# Fit an AutoRegressive (AR) model
# Choose an appropriate lag value `p`. For example, let's use p=5
model = AutoReg(train, lags=7).fit()
# Generate predictions
predictions = model.predict(start=len(train), end=len(data_series)-1, dynamic=False)
# Calculate error (optional)
error = mean_squared_error(test, predictions)
print(f"Test MSE: {error}")
# Plot the observed and predicted values
fig = go.Figure()
# Plot observed values
fig.add_trace(go.Scatter(x=data_series.index, y=data_series, mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test.index, y=predictions, mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series",
xaxis_title="Time",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
import warnings
warnings.filterwarnings("ignore", category=ValueWarning)
Test MSE: 54.081146963782
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[30], line 37 33 fig.show() 36 import warnings ---> 37 warnings.filterwarnings("ignore", category=ValueWarning) NameError: name 'ValueWarning' is not defined
Use Prophet to predict the time series.¶
$Facebook Prophet$ (now simply known as "Prophet") is an open-source forecasting tool developed by Facebook's Core Data Science team. Designed for time series forecasting, it is especially useful for handling data with strong seasonal effects and multiple seasonalities, such as daily, weekly, and yearly trends. Prophet is known for its flexibility in automatically detecting holidays and special events, making it suitable for real-world business scenarios with irregular patterns. With its intuitive interface, Prophet enables analysts and data scientists to quickly build robust forecasting models without deep knowledge of complex statistical modeling. It is implemented in both Python and R, allowing for seamless integration with popular data science workflows.
Here are the main default parameters in Facebook Prophet that influence model behavior and flexibility:
Growth: 'linear'
Determines the type of trend. Defaults to 'linear', but can be set to 'logistic' for saturating growth. Seasonality:
Yearly Seasonality: Enabled by default with a fourier_order of 10.
Weekly Seasonality: Enabled by default with a fourier_order of 3.
Daily Seasonality: Disabled by default but can be enabled with a fourier_order of 4.
Holidays: Disabled by default
Users can specify holiday effects by providing a list of holiday dates.
Changepoint Range: 0.8
The proportion of data (from the start) in which Prophet will place potential changepoints. Default is 0.8, meaning changepoints are considered in the first 80% of the data. Changepoint Prior Scale: 0.05
Controls the flexibility of the trend. Lower values (e.g., 0.01) create a smoother trend, while higher values (e.g., 0.5) allow more flexibility. Interval Width: 0.80
Defines the uncertainty interval for forecasts. By default, it produces an 80% prediction interval.
Uncertainty Samples: 1000
The number of simulations Prophet runs to estimate forecast uncertainty.
Seasonality Mode: 'additive'
Can be set to 'additive' or 'multiplicative', depending on the data's seasonal behavior. These default settings make Prophet a versatile tool for common forecasting tasks, but they can be fine-tuned to capture more complex time series behavior.
## Prophet with default seasonality( 365 days)
import numpy as np
import pandas as pd
from prophet import Prophet
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error,mean_absolute_error, mean_absolute_percentage_error
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(interval_width=0.95) #by default is 80%
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
09:46:45 - cmdstanpy - INFO - Chain [1] start processing 09:46:45 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 25.119804306628804 Test MAPE: 0.22637318281731522
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df[train_size:]['ds'], y=df[train_size:]['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
Improving a Prophet model for better time series forecasting involves several strategies, including tuning model parameters, adding additional regressors, adjusting seasonalities, and making use of domain-specific knowledge. Here are some effective approaches:
1. Tune the Growth Model¶
Prophet can handle linear and logistic growth models. By default, it uses linear growth, but if your data shows saturation effects, switching to a logistic growth model with a defined capacity can improve accuracy. Example:
model = Prophet(growth='logistic')
df['cap'] = 100 # Set the upper limit (capacity)
model.fit(df)
2. Add Seasonality Components¶
By default, Prophet detects yearly seasonality (for datasets with sufficient data) and weekly seasonality. However, you can manually add additional seasonal components (e.g., quarterly or custom seasonalities) if your data exhibits more complex patterns.
model = Prophet()
model.add_seasonality(name='quarterly', period=90.5, fourier_order=8) # Adjust period and fourier_order as needed
model.fit(df)
3. Use Additional Regressors¶
Prophet allows for adding external regressors that may influence your forecast. For example, adding information on holidays, special events, or external factors like temperature, promotions, or economic indicators can improve accuracy.
df['regressor'] = external_data # external_data is an array or series of additional data
model = Prophet()
model.add_regressor('regressor')
model.fit(df)
4. Incorporate Holiday Effects¶
Prophet has a built-in feature to account for holiday effects, which can be especially useful if the time series is affected by significant events or holidays. Prophet supports predefined holiday lists for various countries.
from prophet.make_holidays import make_holidays_df
model = Prophet(holidays=make_holidays_df('US', start='2020-01-01', end='2023-01-01'))
model.fit(df)
5. Adjust Changepoints and Flexibility¶
Changepoints allow the model to detect shifts in trend. Prophet automatically selects changepoints, but sometimes fine-tuning them helps. You can increase n_changepoints or adjust changepoint_prior_scale to allow more or less flexibility in trend shifts.
model = Prophet(changepoint_prior_scale=0.05) # Higher value increases flexibility
model.fit(df)
6. Increase Fourier Order for Seasonalities¶
Increasing the fourier_order for seasonalities allows the model to capture more complex seasonal patterns, but it also increases the risk of overfitting. Experiment with higher values if your seasonality has more intricate patterns.
model = Prophet(yearly_seasonality=False)
model.add_seasonality(name='yearly', period=365.25, fourier_order=20) # Custom yearly seasonality with higher complexity
model.fit(df)
7. Tune Seasonality and Trend Flexibility Using Cross-Validation¶
Prophet has a built-in cross-validation function that can help you assess the model’s performance with different tuning parameters and identify the best configuration.
from prophet.diagnostics import cross_validation, performance_metrics
df_cv = cross_validation(model, initial='730 days', period='180 days', horizon='365 days')
df_p = performance_metrics(df_cv)
print(df_p.head())
8. Increase Forecast Accuracy with Hyperparameter Tuning¶
Systematic tuning of changepoint_prior_scale, seasonality_prior_scale, holidays_prior_scale, and fourier_order for seasonalities can help find the best model. You could automate this using a search method (e.g., grid search) or use libraries like optuna for optimization.
9. Ensure Quality of Data¶
Prophet is sensitive to outliers and missing data, which can negatively impact model performance. Handle any missing values, outliers, or irregular frequency before modeling. By carefully tuning the Prophet model using these strategies, you can often capture more of the inherent patterns in your data and improve forecast accuracy.
1. Tune the Growth Model¶
cap=55
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
09:51:36 - cmdstanpy - INFO - Chain [1] start processing 09:51:36 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.17361803171227 Test MAPE: 0.23425785594156023
2. Add Seasonality Components¶
#### period=365, fourier_order=8,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model= model.add_seasonality(name="yearly", period=365, fourier_order=8) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
09:52:02 - cmdstanpy - INFO - Chain [1] start processing 09:52:02 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.156016782911937 Test MAPE: 0.23506594627394364
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df[train_size:]['ds'], y=df[train_size:]['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
#### Quarterly period=93, fourier_order=8,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model= model.add_seasonality(name="quarterly", period=93, fourier_order=8) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
10:08:52 - cmdstanpy - INFO - Chain [1] start processing 10:08:52 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 27.130378392293093 Test MAPE: 0.23715099114755028
#### Monthly period=31, fourier_order=5,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model= model.add_seasonality(name="monthly", period=31, fourier_order=5) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
10:11:17 - cmdstanpy - INFO - Chain [1] start processing 10:11:17 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.39701071627797 Test MAPE: 0.23503769651039932
#### Weekly period=7, fourier_order=5,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model= model.add_seasonality(name="weekly", period=7, fourier_order=5) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
10:12:08 - cmdstanpy - INFO - Chain [1] start processing 10:12:08 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.13574676458783 Test MAPE: 0.23392506720077097
## Multiplicative is not good.########################
###########################################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
## Creating model parameters
model_param ={
"daily_seasonality": False,
"weekly_seasonality":True,
"yearly_seasonality":False,
"seasonality_mode": "multiplicative",
"growth": "logistic"
}
df['cap'] = df["y"].max() + df["y"].std() * 0.05
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# Initialize and fit the Prophet model
model = Prophet(**model_param)
model.fit(train_df)
# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)
# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()
# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))
# Update plot layout
fig.update_layout(
title="Observed vs Predicted Time Series (Prophet)",
xaxis_title="Date",
yaxis_title="Value",
template="plotly_white"
)
fig.show()
10:12:52 - cmdstanpy - INFO - Chain [1] start processing 10:12:52 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 41.926852277876336 Test MAPE: 0.29944857764618943
Conclusion¶
Even though the time series data appears stationary, and the ACF and PACF plots suggest selecting an AR(7) model, FB Prophet outperforms the AR models. Prophet’s flexibility in capturing seasonality, holidays, and trend changes enables it to deliver more accurate forecasts in this case, making it a powerful alternative to traditional AR models for time series forecasting.